2025-02-04
ols() and lm()Restricted cubic splines can fit many different types of non-linearities. Specifying the number of knots is all you need to do in R to get a reasonable result from a restricted cubic spline.
The most common choices are 3, 4, or 5 knots.
sim_data, plotted.p1 <- ggplot(sim_data, aes(x = x, y = y)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", formula = y ~ x,
col = "red", se = FALSE) +
labs(title = "With Linear Fit")
p2 <- ggplot(sim_data, aes(x = x, y = y)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "loess", formula = y ~ x,
col = "blue", se = FALSE) +
labs(title = "With Loess Smooth")
p1 + p2sim_data, plotted.lmWe’ll fit:
poly()), andrcs())anova(modelname)| Formula | Model df | Resid. df | # obs. |
|---|---|---|---|
lm(y ~ x) |
1 | 248 | 250 |
lm(y ~ poly(x,2)) |
2 | 247 | 250 |
lm(y ~ poly(x,3)) |
3 | 246 | 250 |
lm(y ~ rcs(x,3)) |
2 | 247 | 250 |
lm(y ~ rcs(x,4)) |
3 | 246 | 250 |
lm(y ~ rcs(x,5)) |
4 | 245 | 250 |
augment() for our six modelsaugment() generates fitted y predictions and residuals, which will help us plot the fits for our six models.
sim_linear_aug <- augment(sim_linear, sim_data)
sim_poly2_aug <- augment(sim_poly2, sim_data)
sim_poly3_aug <- augment(sim_poly3, sim_data)
sim_rcs3_aug <- augment(sim_rcs3, sim_data)
sim_rcs4_aug <- augment(sim_rcs4, sim_data)
sim_rcs5_aug <- augment(sim_rcs5, sim_data)
sim_linear_aug |> slice(1:2) |>
gt() |> fmt_number(decimals = 3) |> tab_options(table.font.size = 20)| x | y | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
|---|---|---|---|---|---|---|---|
| 19.635 | 430.407 | 304.638 | 125.769 | 0.008 | 104.004 | 0.006 | 1.213 |
| 43.699 | 666.342 | 669.228 | −2.886 | 0.009 | 104.313 | 0.000 | −0.028 |
p1 <- ggplot(sim_data, aes(x = x, y = y)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", formula = y ~ x,
col = "black", se = F) +
labs(title = "Linear Fit")
p2 <- ggplot(sim_data, aes(x = x, y = y)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", formula = y ~ x,
col = "forestgreen", se = F) +
labs(title = "Loess Smooth")
p3 <- ggplot(sim_poly2_aug, aes(x = x, y = y)) +
geom_point(alpha = 0.5) +
geom_line(aes(x = x, y = .fitted),
col = "blue", linewidth = 1.25) +
labs(title = "Quadratic Polynomial")
p4 <- ggplot(sim_poly3_aug, aes(x = x, y = y)) +
geom_point(alpha = 0.5) +
geom_line(aes(x = x, y = .fitted),
col = "purple", linewidth = 1.25) +
labs(title = "Cubic Polynomial")
(p1 + p2) / (p3 + p4)p0 <- ggplot(sim_data, aes(x = x, y = y)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", formula = y ~ x,
col = "black", se = F) +
labs(title = "Linear Fit")
p3 <- ggplot(sim_rcs3_aug, aes(x = x, y = y)) +
geom_point(alpha = 0.5) +
geom_line(aes(x = x, y = .fitted),
col = "blue", size = 1.25) +
labs(title = "RCS with 3 knots")
p4 <- ggplot(sim_rcs4_aug, aes(x = x, y = y)) +
geom_point(alpha = 0.5) +
geom_line(aes(x = x, y = .fitted),
col = "red", size = 1.25) +
labs(title = "RCS with 4 knots")
p5 <- ggplot(sim_rcs5_aug, aes(x = x, y = y)) +
geom_point(alpha = 0.5) +
geom_line(aes(x = x, y = .fitted),
col = "purple", size = 1.25) +
labs(title = "RCS with 5 knots")
(p0 + p3) / (p4 + p5)Health Evaluation and Linkage to Primary Care (HELP) was a clinical trial of adult inpatients recruited from a detoxification unit.
| Variable | Description |
|---|---|
cesd |
Center for Epidemiologic Studies-Depression |
cesd_hi |
cesd above 15 (indicates high risk) |
help1 data loadhelp1 <- tibble(mosaicData::HELPrct) |>
select(id, cesd, age, sex, subst = substance, mcs, pcs, pss_fr) |>
zap_label() |>
mutate(across(where(is.character), as_factor),
id = as.character(id),
cesd_hi = factor(as.numeric(cesd >= 16)))
dim(help1); n_miss(help1)[1] 453 9
[1] 0
# A tibble: 5 × 9
id cesd age sex subst mcs pcs pss_fr cesd_hi
<chr> <int> <int> <fct> <fct> <dbl> <dbl> <int> <fct>
1 1 49 37 male cocaine 25.1 58.4 0 1
2 2 30 37 male alcohol 26.7 36.0 1 1
3 3 39 26 male heroin 6.76 74.8 13 1
4 4 15 39 female heroin 44.0 61.9 11 0
5 5 39 32 male cocaine 21.7 37.3 10 1
help1cesd using these six predictors…| Variable | Description |
|---|---|
age |
subject age (in years) |
sex |
female (n = 107) or male (n = 346) |
subst |
substance abused (alcohol, cocaine, heroin) |
mcs |
SF-36 Mental Component Score |
pcs |
SF-36 Physical Component Score |
pss_fr |
perceived social support by friends |
What happens when we add a non-linear term?
Adding an interaction (product term) depends on the main effects of the predictors we are interacting
Spearman’s \(\rho^2\) is an indicator (not a perfect one) of potential predictive punch, but doesn’t give away the game.
mcs is the most attractive candidate for a non-linear term, as it packs the most potential predictive punch, so if it does turn out to need non-linear terms, our degrees of freedom will be well spent.
mcs actually needs a non-linear term, or will show meaningfully better results if a non-linear term is included. We’d have to fit a model with and without non-linearity in mcs to know that.pcs, also quantitative, has the next most potential predictive punch after mcs.pss_fr and sex follow, then subst and age.
Spearman rho^2 Response variable:cesd
rho2 F df1 df2 P Adjusted rho2 n
mcs 0.438 350.89 1 451 0.0000 0.436 453
subst 0.034 7.97 2 450 0.0004 0.030 453
pcs 0.089 44.22 1 451 0.0000 0.087 453
age 0.000 0.12 1 451 0.7286 -0.002 453
sex 0.033 15.56 1 451 0.0001 0.031 453
pss_fr 0.033 15.57 1 451 0.0001 0.031 453
Here’s a summary of the degrees of freedom for a main effects model without any non-linear terms.
fit1 <- lm(cesd ~ mcs + subst + pcs + age + sex + pss_fr, data = help1)
glance(fit1) |> select(df, df.residual, nobs) |>
gt() |> tab_options(table.font.size = 20) |>
opt_stylize(style = 3, color = "cyan")| df | df.residual | nobs |
|---|---|---|
| 7 | 445 | 453 |
We started with 453 observations (452 df) and fitting fit1 leaves 445 residual df, so fit1 uses 7 degrees of freedom.
One popular standard for linear regression requires at least 25 observations per regression coefficient that you will estimate1.
So we might choose to include non-linear terms in just two or three variables with this modest sample size (n = 453).
fit2 model …fit2Fit a model to predict cesd using:
mcspcspss_fragesex with the main effect of mcs (restricting our model so that terms that are non-linear in both sex and mcs are excluded), andsubstfit2Definitely more than we can reasonably do with 453 observations, but let’s see how it looks.
%ia% tells R to fit an interaction term with sex and the main effect of mcs.
sex as a main effect for the interaction term (%ia%) to work. We already have the main effect of mcs in as part of the spline.fit2 with lm()?Yes. Note poly() in our lm() fit, rather than pol().
fit2_lm <- lm(cesd ~ rcs(mcs, 5) + rcs(pcs, 3) + sex + mcs %ia% sex +
poly(pss_fr,2) + age + subst, data = help1)
glance(fit2_lm) |> select(df, df.residual, nobs) |>
gt() |> tab_options(table.font.size = 20) |>
opt_stylize(style = 3, color = "cyan")| df | df.residual | nobs |
|---|---|---|
| 13 | 439 | 453 |
fit2_lm uses an additional 6 degrees of freedom beyond the 7 in fit1.fit2 (from ols())Linear Regression Model
ols(formula = cesd ~ rcs(mcs, 5) + rcs(pcs, 3) + sex + mcs %ia%
sex + pol(pss_fr, 2) + age + subst, data = help1, x = TRUE,
y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 453 LR chi2 353.70 R2 0.542
sigma8.5942 d.f. 13 R2 adj 0.528
d.f. 439 Pr(> chi2) 0.0000 g 10.483
Residuals
Min 1Q Median 3Q Max
-26.89761 -6.02129 0.08325 5.35483 26.41138
Coef S.E. t Pr(>|t|)
Intercept 78.4249 6.3159 12.42 <0.0001
mcs -0.9212 0.2307 -3.99 <0.0001
mcs' 1.6452 2.4951 0.66 0.5100
mcs'' -2.8622 8.3648 -0.34 0.7324
mcs''' 0.2687 7.9109 0.03 0.9729
pcs -0.2437 0.0881 -2.77 0.0059
pcs' -0.0118 0.0997 -0.12 0.9060
sex=male -1.6290 2.5443 -0.64 0.5223
mcs * sex=male -0.0194 0.0781 -0.25 0.8040
pss_fr -1.0436 0.4004 -2.61 0.0095
pss_fr^2 0.0570 0.0280 2.03 0.0425
age -0.0512 0.0568 -0.90 0.3679
subst=cocaine -2.7638 0.9934 -2.78 0.0056
subst=heroin -2.2828 1.0653 -2.14 0.0327
fit2This ANOVA testing is sequential, other than the TOTALS.
Analysis of Variance Response: cesd
Factor d.f. Partial SS MS F P
mcs (Factor+Higher Order Factors) 5 26857.364671 5371.472934 72.21 <.0001
All Interactions 1 2.026255 2.026255 0.03 0.8690
Nonlinear 3 293.502251 97.834084 1.32 0.2688
pcs 2 2548.388579 1274.194290 17.13 <.0001
Nonlinear 1 1.705031 1.705031 0.02 0.8797
sex (Factor+Higher Order Factors) 2 451.578352 225.789176 3.04 0.0491
All Interactions 1 2.026255 2.026255 0.03 0.8690
mcs * sex (Factor+Higher Order Factors) 1 2.026255 2.026255 0.03 0.8690
pss_fr 1 448.812293 448.812293 6.03 0.0144
age 1 49.758786 49.758786 0.67 0.4139
subst 2 611.625952 305.812976 4.11 0.0170
TOTAL NONLINEAR 4 293.512204 73.378051 0.99 0.4146
TOTAL NONLINEAR + INTERACTION 5 294.601803 58.920361 0.79 0.5558
REGRESSION 12 38058.315322 3171.526277 42.64 <.0001
ERROR 440 32730.174744 74.386761
fit2 index.orig training test optimism index.corrected n
R-square 0.5420 0.5570 0.5259 0.0311 0.5108 300
MSE 71.5769 68.8206 74.0829 -5.2623 76.8393 300
g 10.4826 10.5784 10.3304 0.2480 10.2346 300
Intercept 0.0000 0.0000 0.7935 -0.7935 0.7935 300
Slope 1.0000 1.0000 0.9753 0.0247 0.9753 300
summary results for fit2summary results for fit2 Effects Response : cesd
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
mcs 21.676 40.941 19.266 -11.01300 1.22920 -13.42900 -8.59710
pcs 40.384 56.953 16.569 -4.21690 0.73316 -5.65780 -2.77590
pss_fr 3.000 10.000 7.000 -2.12120 0.74667 -3.58870 -0.65369
age 30.000 40.000 10.000 -0.51164 0.56762 -1.62720 0.60394
sex - female:male 2.000 1.000 NA 2.18360 0.99288 0.23218 4.13500
subst - cocaine:alcohol 1.000 2.000 NA -2.76380 0.99343 -4.71630 -0.81134
subst - heroin:alcohol 1.000 3.000 NA -2.28280 1.06530 -4.37640 -0.18915
Adjusted to: mcs=28.60242 sex=male
fit2cesd.The nomogram shows modeled effects and their impact on the predicted outcome.
Suppose we want to use our model fit2 to make a prediction for cesd for a new subject, named Grace, who has the following characteristics…
We can build point and interval estimates for predicted cesd from fit2 as follows…
Suppose we have a new individual subject named Grace.
grace <- tibble(sex = "female", mcs = 40, pcs = 50,
pss_fr = 7, age = 45, subst = "cocaine")
predict(fit2, newdata = grace, conf.int = 0.95, conf.type = "individual") |>
as_vector()linear.predictors.1 lower.1 upper.1
26.808537 9.595825 44.021249
Our predicted cesd for Grace is 26.81, with 95% prediction interval (9.60, 44.02).
Predict mean cesd of a set of subjects with Grace’s predictor values, along with a confidence interval.
linear.predictors.1 lower.1 upper.1
26.80854 23.49523 30.12185
fit2We would like our model to be well-calibrated, in the following sense…
cesd outcome and our predicted cesd from the model.x = TRUE, y = TRUE in a fit with ols().
n=453 Mean absolute error=0.363 Mean squared error=0.17069
0.9 Quantile of absolute error=0.627
# Check for Multicollinearity
Low Correlation
Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
rcs(pcs, 3) 1.31 [ 1.19, 1.49] 1.14 0.76 [0.67, 0.84]
poly(pss_fr, 2) 1.09 [ 1.03, 1.29] 1.04 0.92 [0.78, 0.97]
age 1.17 [ 1.09, 1.34] 1.08 0.85 [0.74, 0.92]
subst 1.22 [ 1.12, 1.39] 1.10 0.82 [0.72, 0.89]
Moderate Correlation
Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
rcs(mcs, 5) 5.46 [ 4.66, 6.43] 2.34 0.18 [0.16, 0.21]
sex 7.16 [ 6.09, 8.47] 2.68 0.14 [0.12, 0.16]
High Correlation
Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
mcs %ia% sex 11.85 [10.01, 14.06] 3.44 0.08 [0.07, 0.10]
The collinearity plot is a bit hard to see with all of these terms, so we can just look at the variance inflation factors:
mcs mcs' mcs'' mcs''' pcs
53.711079 4838.370091 12475.902431 2489.147506 5.521090
pcs' sex=male mcs * sex=male pss_fr pss_fr^2
5.365910 7.163012 11.848760 15.657046 15.885078
age subst=cocaine subst=heroin
1.172137 1.349517 1.383641
Warning: Heteroscedasticity (non-constant error variance) detected (p = 0.048).
OK: residuals appear as normally distributed (p = 0.986).
OK: No outliers detected.
- Based on the following method and threshold: cook (0.9).
- For variable: (Whole model)
fit1?lm() and ols()lm and ols to fit a model like fit2.To delve into the details of how well this complex model works, and to help plot what is actually being fit, we’ll want to fit the model using ols().
lm() and others that are most easily obtained using ols().432 Class 07 | 2025-02-04 | https://thomaselove.github.io/432-2025/